Skip to content

fix: Correctly limit pushed dataset items in PPE-aware mode#570

Merged
janbuchar merged 5 commits intomasterfrom
fix-push-data-ppe
Mar 9, 2026
Merged

fix: Correctly limit pushed dataset items in PPE-aware mode#570
janbuchar merged 5 commits intomasterfrom
fix-push-data-ppe

Conversation

@janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Mar 2, 2026

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 2, 2026
@github-actions github-actions bot added this to the 135th sprint - Tooling team milestone Mar 2, 2026
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 2, 2026
@janbuchar janbuchar added the adhoc Ad-hoc unplanned task added during the sprint. label Mar 2, 2026
{
pricingModel: 'PAY_PER_EVENT',
pricingPerEvent: {
actorChargeEvents: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do one more test where apify-default-dataset-item is in the pricing. The question is what SDK should do if user has that and does pushData(data, 'my-custom-event') and has both events defined. I think the correct is to charge both apify-default-dataset-item and my-custom-event

unknown
>,
> extends DatasetClient<Data> {
private normalizeItems(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to look for a way we could skip this extra JSON.parse but I don't have anything on my mind. I'm bit worried about potential perf implications here. But we can do it with a TODO later (and test first).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well worry no more, here's an actual performance problem 🙂 https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L285-L285 - if we pass an array of objects to the dataset client, it will run it through a much slower validation code path. I wonder if that's the reason we bother with the stringification.

I guess we can just re-stringify the items before calling super.pushItems though, right?

Copy link
Member

@metalwarrior665 metalwarrior665 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh so that's another thing but it was there probably since inception.

I think the stringification in Crawlee storage is simply that we already have to do it anyway to check the item size to decide if/how we need to chunk it so it fits the 9MB API limit. And since we already have it stringified, we send it along because otherwise the HTTP client does that.

In ideal case, there is just a single JSON.stringify. With your PR, we now have stringify -> parse -> stringify. But maybe it would need deeper refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One ugly perf workaround would be to only do this parse/stringify in case of apify-default-dataset-item is present, otherwise we don't need to use the intercepted client. Then it will hit smaller portion of Actors. Just an option, not convinced it is worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like that idea, let's go with that.

@janbuchar janbuchar marked this pull request as ready for review March 6, 2026 12:24
@janbuchar janbuchar requested a review from B4nan March 6, 2026 12:24
@B4nan B4nan force-pushed the fix-push-data-ppe branch from 84c84db to 7f4e892 Compare March 6, 2026 22:54
@janbuchar janbuchar merged commit 47c58d9 into master Mar 9, 2026
20 checks passed
@janbuchar janbuchar deleted the fix-push-data-ppe branch March 9, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants